Team 1
07 January 2021
1. Examining Continuous Variables
2. Looking for Structure: Dependency Relationships and Associations
3. Investigating Multivariate Continuous Data
Qplot automatically increases the bin size of the histogram, which shows a bimodal distribution with tails that increase on both sides of the histogram.
## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.
data(galaxies)
galaxies <- as.data.frame(galaxies)
names(galaxies) <- 'Velocity'
par(fig=c(0,1,0,1),new=T)
qplot(galaxies$Velocity) +
labs(title='Histogram of Galaxy Velocity',
x='Velocity of Galaxy',
y='Frequency')## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The density plot of the model shows three distinct superclusters with the far right tail not being as distinct.
library(mclust)
mod <- mod <- Mclust(galaxies$Velocity)
par(fig=c(0,1,0,1),new=T)
plot(mod,what="density")In order to present all the information, I think we need at least 5 different plots to spot all the factors the data set can provide. Boxplot, histogram, rugplot, dotplot, they can all provide different informations.
There are several different histogram forms, each telling a separate story. Default binwidths, dividing each variable’s range by 30, have been used. Other scalings could reveal more information and would be more interpretable. Is interesting that the vertical scales vary from maxima of 40 to over 400. Plotting histograms individually, choosing binwidths and scale limits are the main decisions to be taken.
data(survey, package="MASS")
par(fig=c(0,1,0,1),new=T)
hist(survey$Height,
xlab = 'Height',
main = 'Histogram of Student`s Height',
ylab = 'Frequency') b) Examination of national survey data on young adults shows that the separation between the distributions of men’s and women’s heights is not wide enough to produce bimodality.
library(ggplot2)
data(movies, package="ggplot2movies")
par(fig=c(0,1,0,1),new=T)
hist(movies$year[movies$length == 90 | movies$length == 7],
xlab = 'Year',
main = 'Histogram of Number of movies after 1980',
ylab = 'Nr.') a) The histogram shows that we have the peaks of 7 minutes or 90 minutes length for both periods: before 1980 and after 1980.
table
library(lawstat)
data(zuni, package="lawstat")
par(fig=c(0,1,0,1),new=T)
hist(zuni$Revenue,
xlab = 'Revenue',
main = 'Revenue Histogram',
ylab = 'Nr.') I prefer a histogram for showing 5% the lowest and the highets.
There is no “h39b.W1” attribute on CHAIN variable because it has been renamed in “log_virus”. For both cases I would use a histogram because I can easily see the number for each case.
## Loading required package: Matrix
## Loading required package: stats4
## mi (Version 1.0, packaged: 2015-04-16 14:03:10 UTC; goodrich)
## mi Copyright (C) 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015 Trustees of Columbia University
## This program comes with ABSOLUTELY NO WARRANTY.
## This is free software, and you are welcome to redistribute it
## under the General Public License version 2 or later.
## Execute RShowDoc('COPYING') for details.
library(ggplot2)
data(CHAIN)
par(fig=c(0,1,0,1),new=T)
hist(CHAIN$log_virus,
xlab = 'Case',
main = 'Histogram of virus cases with 0s',
ylab = 'Nr.')library(mi)
library(ggplot2)
data(CHAIN)
par(fig=c(0,1,0,1),new=T)
hist(CHAIN$log_virus[CHAIN$log_virus != 0],
xlab = 'Case',
main = 'Histogram of virus cases without 0s',
ylab = 'Nr.')A diamond’s weight can be found in “carat” attribute. Let’s see how can we see
I wanted to put in balance the weight of a diamond with it’s price. Aparently the most expensive diamonds’s weight is between 1,5 and 3 grams. Some of the most cheapest diamonds have weight the least.
data(diamonds, package="ggplot2")
par(fig=c(0,1,0,1),new=T)
hist(diamonds$price,
xlab = 'Price',
main = 'Histogram of Diamonds Prices',
ylab = 'Frequency') For the distribution of Diamond Prices, I chose a histogram. I think it is very easy to understand looking at this histogram that the most expensive diamonds are the fewest. I think a factor that the most expensive diamonds are the fewest is that those diamonds are very rare and very hard to find. Another factor is that it requires more work than the others.
Figure 5.7
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(movies, package = "ggplot2movies")
ggplot(movies, aes(votes, rating)) + geom_point() + ylim(1, 10)(a) Excluding all films with fewer than 100 votes:
library(ggplot2)
library(dplyr)
data(movies, package = "ggplot2movies")
filtered <- filter(movies, votes > 100)
ggplot(filtered, aes(votes, rating)) + geom_point() + ylim(1, 10) Observation: the highest value for rating decreased.
(b) Excluding films with average rating greater than 8.7:
library(ggplot2)
library(dplyr)
data(movies, package = "ggplot2movies")
filtered2 <- filter(movies, rating < 8.7 )
ggplot(filtered2, aes(votes, rating)) + geom_point() + ylim(1, 10) In order to exclude the movies which have the least number of votes, we could filter them by taking only those which do not exceed a certain big rating. Those with the biggest rating usually have the least votes so they are irrelevant.
We could do the following things:
we could try to not exclude those movies with a lot of votes, that represent the outliers from the right part of the plot, and take only those not exceeding a rating of 9;
or if we do not want to take into consideration those outliers, we could narrow the permitted range by choosing only those with a rating less than 8.7. In this way, there are less movies with a very little number of votes (the vertical line that was seen before)
(a) Number of observations in each experimental group (n.e) against the corresponding number of observations in each control group (n.c):
Observations:
there is a linear relationship between the variables;
there are 4 outliers for higher values of the both variables (2500, 6000, 8500);
there are several gaps in the dataset, between the outliers previously mentioned;
there seems to be some overplotting at the lower values.
(b) Restricting the scatterplot to only those with less than 100 patients in each group:
data(Olkin1995, package = "meta")
ggplot(Olkin1995, aes(n.exp, n.cont)) + geom_point() + ylim(1, 100) + xlim(1, 100)(a) Scatterplot of average revenue per pupil (Revenue) against the corresponding number of pupils (Mem):
Observations:
there are two outliers for higher values for the Mem variable;
there are 4 outliers for values close to 0 for the Mem variable and bigger values for the Revenue;
there is a certain interval in which the values are situated and it may be a case of overplotting, as there are 420 rows in the dataset and only a few points on the scatterplot.
(b) Plotting against log of the number of pupils preserves the order of the observations while making outliers less extreme. So the log transform enhances the visualization.
Logging also revenue per pupil adds no other insight to the scatterplot, as shown below:
(a) Scatterplot of the heights:
There are some outliers, for both higher and lower values of the variables, but it is hard to determine which ones.
(b) Including both points and highest density regions:
## This is hdrcde 3.3
data(father.son, package = "UsingR")
par(mar = c(3.1, 4.1, 1.1, 2.1))
with(father.son,
hdr.boxplot.2d(
fheight,
sheight,
show.points = TRUE,
prob = c(0.01, 0.05, 0.5, 0.75)
)) After using a density estimate, it’s easier to see determine some outliers, outside the contours: 3 for lower values and 2-3 for higher values, all bivariate.
Note: mar= A numeric vector of length 4, which sets the margin sizes in the following order: bottom, left, top, and right. The default is c(5.1, 4.1, 4.1, 2.1).
(c) Fitting a linear model to the data and a loess smooth:
data(father.son, package = "UsingR")
ggplot(father.son, aes(fheight, sheight)) + geom_point() + geom_smooth(method =
"lm", colour = "red") + geom_abline(slope = 1, intercept = 0)## `geom_smooth()` using formula 'y ~ x'
A nonlinear model is not necessary, as the two curves are almost identical:
data(father.son, package = "UsingR")
ggplot(father.son, aes(fheight, sheight)) + geom_point() +
geom_smooth(method = "lm",
colour = "red",
se = FALSE) +
stat_smooth()## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
A subset of Roberts’ bank sex discrimination dataset from 1979 is available in the package Sleuth2 under the name case1202.
(a) Scatterplot matrix of the three variables: Senior, Age and Exper:
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
data(case1202, package = "Sleuth2")
par(mar = c(1.1, 1.1, 1.1, 1.1))
ggpairs(
case1202[, c(4:5, 7)],
title = "Bank discrimination",
diag = list(continuous = 'density'),
axisLabels = 'none'
)Observations
experience increases with age (these two variables are positively associated, with a high correlation coefficient of 0.798). From the point represented by 400 months of Age and 100 months of Experience there appears to be some outliers and the points scatter;
there doesn’t seem to be any association between Seniority and the other two variables, as the correlation coefficient is close to zero (there is no linear correlation between the variables);
there are more young and inexperienced employees.
(b) Scatterplots involving seniority do not have the structure of the scatterplot of experience against age because:
Figure 5.8.
data(Cars93, package = "MASS")
ggplot(Cars93, aes(Weight, MPG.city)) + geom_point() +
geom_smooth(colour = "green") + ylim(0, 50)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Note: fuel economy decreases with weight quite quickly initially and then more slowly.
Plotting 1/MPG.City (litres per 100 km instead of miles per gallon) against Horsepower:
data(Cars93, package = "MASS")
ggplot(Cars93, aes((1 / MPG.city), Horsepower)) + geom_point() + geom_smooth(colour ="green") ## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Plotting the two variables against each other we get a linear relationship, because the more horsepower it has, the better the gas mileage is. But after a certain value (at about 0.06 litres per 100 km), the cars with less horsepower need more fuel to gain speed.
The outliers represent the cars that have a lot more horsepower than the others.
The leafshape dataset in the DAAG package includes three measurements on each leaf (length, width, petiole) and the logarithms of the three measurements.
(a) Sploms for the two sets of three variables
library(GGally)
library(ggplot2)
data(leafshape, package = "DAAG")
par(mar = c(1.1, 1.1, 1.1, 1.1))
ggpairs(
leafshape[, c(1:3)],
title = "Standard Leaf measurements",
diag = list(continuous = 'density'),
axisLabels = 'none'
)ggpairs(
leafshape[, c(7:5)],
title = "Logaritmic Leaf measurements",
diag = list(continuous = 'density'),
axisLabels = 'none'
)In the first set of variables there are many points with the same value for the leaf length (bladelen) and width (bladewid). That’s why there is so much overplotting in the first scatterplot.
For the second set of variables, the log transformation preserves the order of the observations while making outliers less extreme. So the log transform will enhance the visualization.
I consider the second one more useful as it shows more insight and provides a way to avoid plotting points that represent more than one case.
(b) Coloring the cases by the variable arch, describing the leaf architecture:
## Loading required package: carData
## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:lawstat':
##
## levene.test
data(leafshape, package = "DAAG")
par(mar = c(1.1, 1.1, 1.1, 1.1))
spm(
leafshape[c(1:3)],
pch = c(16, 16),
diagonal = "histogram",
smoother = FALSE,
reg.line = FALSE,
groups = leafshape$arch
)library(car)
data(leafshape, package = "DAAG")
par(mar = c(1.1, 1.1, 1.1, 1.1))
spm(
leafshape[c(7:5)],
pch = c(16, 16),
diagonal = "histogram",
smoother = FALSE,
reg.line = FALSE,
groups = leafshape$arch
)By doing this, we can now observe two clusters formed, based on the arch variable, for each scatterplot in the matrix.
(a) Scatterplot matrix of the eight continuous variables representing fatty acids:
library(GGally)
library(ggplot2)
data(olive, package = "zenplots")
ggpairs(
olive[, c(3:10)],
title = "Olive acids",
diag = list(continuous = 'density'),
axisLabels = 'none'
)(b) There might be some outliers depending on different variables:
stearic, we have outliers for higher values on the x-axis when plotting against all other variables;
there is one outlier in every plot of palmitic against all of the other variables (column one of the matrix), usually with a lower value on the x-axis (excepting the one in the plot against the oleic variable, where they are negatively associated);
there is one outlier in every plot of oleic against all of the other variables (column four of the matrix), usually with a higher value on the x-axis (excepting the first three variables);
there are 2-3 outliers for every plot of linolenic variable against the others, for high values.
Other observations:
For the plots against the eicosenoic variable we can see that the value that appears most often in the set of data (value called mode) is 0, because most values have the Y coordinate equal to 0.
For the linolenic variable we can observe some gap intervals in the distribution of Y, especially for lower values;
(a) Splom of all continous variables, excepting chas:
library(GGally)
library(ggplot2)
data(Boston, package = "MASS")
ggpairs(
Boston[,-c(4)],
title = "Boston housing",
diag = list(continuous = 'density'),
axisLabels = 'none'
)Variables that are positively associated with medv are: rm, as there is a correlation coefficient of 0.695 between them.
(b)
Several scatterplots involving the variable crim have an unusual form. The reason might be that there are only a few suburbs in Boston in which the crime rate is really high, and therefore the points with bigger values have a lower frequency and are more isolated.
Also, the scatterplots involving the variable crim are all skewed to the right.
(c) Observations
the variables are negatively associated;
a few outliers for bigger values of the lstat variable;
there are no observations with high percentage of lower status of the population (lstat) and high median value of owner-occupied homes (mdv);
for a larger value for the mdv, the value of the lstat increases with the mdv.
the scatterplot is skewed to the right;
there are generally points with low rates of crime and it doesn’t seem to be a relationship between the proportion of blacks and the crime rate;
there are a lot of outliers with both high and low values, but one has a considerably high crime rate and also a bigger proportion of blacks
rad takes only integer values, as it represents an index, therefore the scatterplot is represented by horizontal lines of points, corresponding to the observations with the y-axis equal to the index;
the majority of the observations have an index of 1-8, but there are some cases with a higher index of 24. So the interval [9,23] on the y-axis represents a gap;
there exists no association between these two variables, and the points are equally distributed on the intervals.
suburbs with very high lstat have older houses in average;
there are a few outliers, for both younger and older houses;
they are positively associated, with a correlation coefficient of 0.544;
there are some values for the tax variable that are more common, that’s why there are several vertical lines on the plot. The suburbs with aprox. 680 tax value may have any average lstat from the lowest to the highest;
there are 4 outliers for bigger values of the tax variable (aprox. 720);
there are also a lot of outliers for the suburbs with very high property tax.
x axis is BV, and the y axis is V
library('GGally')
# (a)
ggparcoord(data = swiss, columns=c(1:6), scale="uniminmax", alphaLines=0.2) +
xlab("") + ylab("")(b) There might be some outliers depending on different variables:
(c) It looks like the variable Catholic has 2 modes, one at the lowest range (0 - 0.25) and one at the highest ends (0.8 - 1.0). So it is the case in Switzerland that usually a province will either have a majority of catholic or of non-catholic people.
# (d)
swiss1 <- within(swiss,
catholics_level <- factor(ifelse(Catholic > 80, 'High', 'Low')))
ggparcoord(data = swiss1[order(swiss1$Catholic),], columns=c(1:6), scale="uniminmax",
groupColumn="catholics_level", alphaLines=0.5) +
xlab("") + ylab("") +
theme(legend.position = "none")(d) The provinces with high level of Catholic look like they have a higher index of Fertility, a lower level of Examination (i.e. % draftees receiving highest mark on army examination) and Education. The Infant.Mortality variable looks like it is not affected that much by whether the province has a majority of catholic or non-catholic people.
## Loading required package: tools
(a) In this pcp we could see different details:
## [1] 0.53 0.56 0.60 0.63 0.67 0.67 0.67 0.67 0.68 0.72 1.50 1.56 1.62 1.62 1.63
## [16] 1.65 1.67 1.81 1.82 1.83 1.83 1.86 1.92 1.94 1.94 1.99 2.00 2.05 2.06 2.06
## [31] 2.33 3.43 3.47 3.77 3.88 3.94 4.26 4.30 4.52 5.34 5.51 5.64 5.69 5.91 7.23
pottery1 <- within(pottery,
mgo_level <- factor(ifelse(MgO < 1, 'Low', 'High'))) # use the 1.0 threshold here
ggparcoord(data = pottery1[order(pottery1$MgO),], columns=c(1:9), scale="uniminmax",
groupColumn="mgo_level", alphaLines=0.5) +
xlab("") + ylab("") +
theme(legend.position = "none")(b) On the other variables, the cases with low MgO also have lower Fe2O3, CaO, Na2O, K2O, MnO than the other cases. Also, some of these cases have higher values for Al2O3 and TiO2.
# (c)
ggparcoord(data = pottery, columns=c(1:9), scale="uniminmax",
groupColumn="kiln", alphaLines=0.5) +
xlab("") + ylab("") + geom_line(size=0.7)(c) In this pcp we can see some differences between different kilns as follows:
## pdfCluster 1.0-3
##
## Attaching package: 'pdfCluster'
## The following object is masked from 'package:dplyr':
##
## groups
data("oliveoil")
ggparcoord(data = oliveoil, columns=c(3:10), scale="uniminmax", alphaLines=0.2) +
xlab("") + ylab("")(a) Some of the observed features in this pcp:
# (b)
ggparcoord(data = oliveoil, columns=c(3:10), scale="uniminmax",
groupColumn="region", alphaLines=0.7) +
xlab("") + ylab("") (b) In this pcp we can find that:
(c) For the scatterplot matrix down below:
While for a pcp:
ggpairs(oliveoil[,c(3:10)], title="Olive acids", diag=list(continuous='density'), axisLabels='none')data(Cars93, package="MASS")
col_indices = which(names(Cars93)%in%c('Price', 'MPG.city', 'MPG.highway', 'Horsepower', 'RPM', 'Length', 'Width', 'Turn.circle', 'Weight'))
ggparcoord(data = Cars93, columns=col_indices, scale="uniminmax", alphaLines=0.7) +
xlab("") + ylab("") (a) In this plot we could conclude the following:
(b) I would plot a pcp (down below), and here we could observe some differences between USA and non-USA cars:
# (b)
ggparcoord(data = Cars93, columns=col_indices, scale="uniminmax",
groupColumn="Origin", alphaLines=0.7) +
xlab("") + ylab("")(c) Yes, a pcp with uniminmax scaling is informative, since we got to extract some insights from it in (a) and (b). Down below we can find the same pcp but with a standard scale applied (subtracting the mean and dividing by the standard deviation, for each axis) and having its observations categorized by the number of Cylinders. Here we can also see that:
# (c)
ggparcoord(data = Cars93, columns=col_indices,
groupColumn="Cylinders", alphaLines=0.7) +
xlab("") + ylab("") + geom_line(size=0.75)data(bodyfat, package="MMST")
ggparcoord(data = bodyfat, columns=1:15, scale="uniminmax", alphaLines=0.7) +
xlab("") + ylab("") (a) There is clearly one outlier which has the maximum value 1.0 on the uniminmax scale in this pcp for many variables (bodyfat, weight, neck, chest, abdomen, hip, thigh, knee, biceps, wrist). Not being the tallest man from the sample (looking at its height), I would say this outlier is not an athlete, but maybe a person with serious obesity problems.
Particularly, there are 2 more outliers on the ankle axis, who might also be outliers on the hip, abdomen and chest measurements.
(b) The height variable looks like it has many points of concentration, like it would be a categorical variable. Maybe the reason why this is happening is that the height was measured only in one decimal instead of using a higher precision. We can quickly check this down below:
# by looking at some of the observations, actually it looks like the data contains estimations to the closest quarter float (0.00/0.25/0.50/0.75).
bodyfat$height[1:10]## [1] 67.75 72.25 66.25 72.25 71.25 74.75 69.75 72.50 74.00 73.50
So the reason for this “categorical behaviour” is indeed the low precision of the floating numbers.
(c) As seen in the pcp, as the density increases, the bodyfat decreases, and vice-versa. So there is clearly a negative correlation between those 2 variables. This is also suggested by the intersection of all profiles in (what appears to be) a single point.
(d) Yes, the ordering of the variables can affect the pcp display. I think in this specific dataset we can try ordering the variables after their medians, so we can see all the measurement categories from highest to lowest to make a better idea about our data.
medians = apply(bodyfat[, 1:15], 2, median, na.rm=TRUE)
ordered_medians_indexes = order(medians)
ggparcoord(data = bodyfat, alphaLines=0.3,
scale="globalminmax", order=ordered_medians_indexes) + coord_flip()Here we can see for example that the wrist has the smallest measurement out of all the body measurements and that the chest is the largest part of a man’s body.
(a) I think a good pcp to present to others would be the following, with globalminmax scale and y limits 0-100, where we can deduce that:
## Loading required package: ellipse
##
## Attaching package: 'ellipse'
## The following object is masked from 'package:car':
##
## ellipse
## The following object is masked from 'package:graphics':
##
## pairs
##
## Attaching package: 'SMPracticals'
## The following object is masked from 'package:HSAUR2':
##
## smoking
## The following object is masked from 'package:GGally':
##
## pigs
## The following objects are masked from 'package:MASS':
##
## cement, forbes, leuk, shuttle
ggparcoord(data = mathmarks, columns=1:5, scale="globalminmax", alphaLines=0.7) +
xlab("") + ylab("") + coord_cartesian(ylim=c(0,100))We could try to plot a pcp by highlighting the students who scored better on the Vectors exam. In this pcp we can observe that:
mathmarks1 <- within(mathmarks,
vectors_perf <- factor(ifelse(vectors > 60, 'Good', 'Bad')))
ggparcoord(data = mathmarks1, columns=1:5, alphaLines=0.7, scale="globalminmax",
groupColumn='vectors_perf') +
xlab("") + ylab("") + theme(legend.position = "none") + coord_cartesian(ylim=c(0,100))# (b)
ggparcoord(data = mathmarks, columns=1:5, scale="globalminmax", alphaLines=0.1, boxplot=TRUE) +
xlab("") + ylab("") + coord_cartesian(ylim=c(0,100))(b) By using the boxplots, we can see easier the maximum, minimum, median values and the outliers for each exam. By looking at this plot, we can’t tell for sure that the Mechanics and Vectors were closed-book exams. We can say for sure that the Statistics exam (an open-book exam) has the lowest median mark out of all the subjects which would contradict somehow the idea of it being an open-book exam.
However, by looking at the minimum value (~3 points) of the Mechanics exam, we can say it is the lowest among all subjects, which confirms the fact it has been a closed-book exam.
Regarding the polygonal lines, I think it is not mandatory to draw then in a boxplot pcp in this specific scenario, even by using the alphaLines option, since it just fills the space without any useful additional information, as it can be seen above.
(a) By using the following pcp’s, which separate the profiles by the wine class (red - Barbera, green - Barolo, blue - Grignolino), we can easily distinguish which variables help us by classifying different wines:
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
data(wine, package="MMST")
a = ggparcoord(wine, columns=1:13, groupColumn="class",scale="uniminmax") + xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("red","grey", "grey")) + coord_flip()
b = ggparcoord(wine, columns=1:13, groupColumn="class",scale="uniminmax") + xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("grey","green", "grey")) + coord_flip()
c = ggparcoord(wine, columns=1:13, groupColumn="class",scale="uniminmax") + xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("grey","grey", "blue")) + coord_flip()
grid.arrange(a, b, c, nrow=1)(b) There are for sure enough outliers in the data, just by looking for example at the extreme high/low values from the blue pcp at the MalicAcid, Ash, Flav, Hue variables.
(c) Yes, there might be some evidence that these classes have subgroups of wines inside them. Some examples would be:
data(Boston, package="MASS")
hcluster = hclust(dist(Boston), method='ward.D2')
clu4 = cutree(hcluster, k=4)
clus = factor(clu4)
boston1 = cbind(Boston, clus)
# (a)
ggparcoord(boston1, columns=1:14, groupColumn="clus", scale="uniminmax") +
xlab("") + ylab("")# (b)
a = ggparcoord(boston1[which(boston1$clus == 1),], columns=1:14, scale="uniminmax",
mapping=aes(color='#f5aca7')) +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("#f5aca7"))
b = ggparcoord(boston1[which(boston1$clus == 2),], columns=1:14, scale="uniminmax",
mapping=aes(color='#a8c75a')) +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("#a8c75a"))
c = ggparcoord(boston1[which(boston1$clus == 3),], columns=1:14, scale="uniminmax",
mapping=aes(color='#1ec5c9')) +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("#1ec5c9"))
d = ggparcoord(boston1[which(boston1$clus == 4),], columns=1:14, scale="uniminmax",
mapping=aes(color='#cc8afd')) +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("#cc8afd"))
grid.arrange(a, b, c, d)# (c)
a = ggparcoord(rbind(boston1[which(boston1$clus != 1),], boston1[which(boston1$clus == 1),]),
columns=1:14, groupColumn="clus", scale="uniminmax") +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("#f5aca7","grey", "grey", "grey"))
b = ggparcoord(rbind(boston1[which(boston1$clus != 2),], boston1[which(boston1$clus == 2),]),
columns=1:14, groupColumn="clus", scale="uniminmax") +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("grey","#a8c75a", "grey", "grey"))
c = ggparcoord(rbind(boston1[which(boston1$clus != 3),], boston1[which(boston1$clus == 3),]),
columns=1:14, groupColumn="clus", scale="uniminmax") +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("grey","grey", "#1ec5c9", "grey"))
d = ggparcoord(rbind(boston1[which(boston1$clus != 4),], boston1[which(boston1$clus == 4),]),
columns=1:14, groupColumn="clus", scale="uniminmax") +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("grey","grey", "grey", "#cc8afd"))
grid.arrange(a, b, c, d)The first plot could be useful when not enough space is available, however, when 4 clusters must be displayed, it becomes quickly really hard to distinguish the different categories, the plot becoming a messy mix of colors.
The second way, where we plot each cluster individually, looks much cleaner than the other ones. However, we can observe that the shapes are not the same as in the other two plots. That’s because plotting them individually will cause the minimum and maximum values on each axis to change, restricting them to the cluster domain only. So this way, a uniminmax scale would not be so helpful to make comparisons between clusters, but maybe a globalminmax would be more appropriate for that kind of task.
The third option, which consists of plotting each cluster individually and the other ones in the background, looks like a clean way of visualizing the groups and at the same time comparing their differences. It preserves the axis limits and we can spot much easier the outliers. If I had to choose, this would be the way I would display my clustering results.